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1. INTRODUCTION 

Good database design is crucial to obtain a sound, 
consistent database, and — in turn — good database 
design methodologies are the best way to achieve 
the right design. These methodologies are taught 
to most Computer Science undergraduates, as part 
of any Introduction to Database class [33[. They 
can be considered part of the "canon", and indeed, 
the overall approach to database design has been 
unchanged for years. Moreover, none of the ma- 
jor database research assessments identify database 
design as a strategic research direction [H, 0, 0] ■ 

Should we conclude that database design is a solved 
problem? 

Our thesis is that database design remains a crit- 
ical unsolved problem. Hence, it should be the sub- 
ject of more research. Our starting point is the 
observation that traditional database design is not 
used in practice — and if it were used it would result 
in designs that are not well adapted to current en- 
vironments 0. In short, database design has failed 
to keep up with the times. In this paper, we put 
forth arguments to support our viewpoint, analyze 
the root causes of this situation and suggest some 
avenues of research. The point of view espoused 
here has been put forth more or less explicitly in 
other places (see [11] for a recent and notable exam- 
ple); but here we put together several strands that 
have received isolated attention, and focus them on 
an issue that we feel is particularly important — 
database design. 

In the next section (§ HI), we sketch the tradi- 
tional database design process: we argue that it 
manages to be, at the same time, over- engineered 
and under- engineered. The contradiction is only ap- 
parent: as any complex problem, this one is multi- 
faceted. Traditional design does too little with re- 
spect to some areas and too much with respect to 
others. In § [3l we analyze the causes of the prob- 
lems presented in § [2l We then briefly the current 
status of research on database design (§[4|. Finally, 



we present some ideas for a research renewal in § [5] 

2. TRADITIONAL MODELING 

Relational modeling is usually broken down into 
three steps: 

• Conceptual modeling, which includes re- 
quirement gathering and specification, and re- 
sults in a conceptual model of the database. 
At this stage, the designer focuses on issues of 
scope — what belongs in the database ? — and 
organization — houj is the information to be 
structured? Entity- relationship diagrams [l5| 
and UML class models are the two best known 
conceptual models, but not the only ones; al- 
ternatives like Object Role Modeling have been 
proposed [iot . 

• Logical modeling, which takes as input the 
conceptual model produced in the previous step 
and yields a database schema. This step is well 
developed (60j . Normalization enforces func- 
tional dependencies by removing redundancy. 

• Physical modeling, which takes as input the 
database schema produced in the previous step 
and produces storage structures to implement 
the schema in computer systems. It can be 

ii,[2i|. 



automated to a large extent 



Each step focuses on only one aspect of the prob- 
lem which helps tame the complexity. Also, each 
step produces an output that feeds into the next 
step, creating a linear structure that is easy to fol- 
low. 

2.1 Problems with the traditional approach 

The problem of database design is difficult, and 
it encompasses issues that may not be amenable 
to formalization (s^ . Hence, any method is likely 
to have some limitations and drawbacks. However, 
this is not a reason to ignore the serious problems 
that the traditional approach is running into. Here 



we summarize what we see, from our experience and 
perspective, as the most troublesome issues. 

2.1.1 Failures of use and guidance 

We claim that the traditional approach is not fol 



lowed in practice. Indeed, Fitzgerald et al. [23 1 
found that only about 11% of the consulted or- 
ganizations claimed using an unmodified commer- 
cial information system methodology. Furthermore, 
Brodie and Liu [11| report that while 90% of all in- 
formation systems inside a Fortune 100 company 
are relational, they could not find a single instance 
of an entity-relationship modeling in over ten such 
large corporations. The lack of modeling is not due 
to the lack of complexity: they report that a typ- 
ical Fortune 100 company has about 10 thousand 
different information systems, that a typical rela- 
tional database is made of over 100 tables, each con- 
taining between 50 to 200 attributes. Formalized 
conceptual models, as well as the theory developed 
around normalization, are not used. Physical mod- 
eling is frequently delayed until performance prob- 
lems arise. In a very real way, we have entered a 
post-methodological era as far as the design of in- 
formation systems is concerned The emergence 
of the Web has coincided with the death of the dom- 
inant methods based on the analytic thought and 
lead to the emergence of sense-making as a primary 
paradigm. 

If one agrees that the traditional method is not 
used, the obvious question is: Why? Why do practi- 
tioners dismiss a method that has a solid theoretical 
basis and is the distillation of years of thought? It 
would be easy (and tempting!) to blame the design- 
ers or their training. But the tools themselves share 
a good part of the blame. They fail to give what 
designers need most, guidance as to how to apply 
them: for conceptual models, not enough guidance 
is given as how to create one, how to assess its qual- 
ity, and — importantly — how to handle all infor- 
mation that docs not fit into the conceptual model 
but may be relevant later for data quality of other 
purposes. 

A critical failure in the traditional approach is 
that there is little guidance on how to discover im- 
portant information (e.g., functional dependencies) 
in the real world. It would not be a concern if the 
rest of the design assumed that we lacked informa- 
tion. Yet unless we have all functional dependen- 
cies, there is no guarantee of normal form in the log- 
ical design. Thus, the logical design phase is brittle. 

Ironically, the step where most research has fo- 
cused o n g iving guidance is the last one, physical 
design [l3|, perhaps because it is easier to simu- 



late realistically the problems and their solutions 
in a laboratory. However, this third step relies on 
the previous ones; while it can sometimes result in 
modifications of the database schema — as when 
denormalization is recommended, most approaches 
still assume that a schema has been well designed. 
By analogy, we could say that we know how to build 
the walls, as long as the foundation of the house is, 
somehow, done properly. 

2.1.2 Failures of imagination 

Even if one were to follow the steps of the tradi- 
tional design method, and have a perfectly normal- 
ized, by-the-book database, what does one obtain? 

We consider database design a matter of seman- 
tics: we are trying to capture the semantics of a do- 
main, to represent information about that domain 
faithfully, and to (only) allow operations with the 
data that are meaningful. But traditional database 
design focuses on structure. In exchange for all the 
effort, we have insufficient semantics. This is the 
sense in which databases are under- engineered. 

Consider, for instance, the problem of informa- 
tion integration 2^, 54, IS^- Relational databases 
fail to provide enough information to determine au- 
tomatically whether two databases contain informa- 
tion about distinct, overlapping, or similar domains. 
And yet, integration of information is increasingly 
critical: 40% of the cost associated with information 



systems is due to data integration problems [llj . To 
exemplify this trend toward greater integration and 
collaboration even in the most conservative settings, 
consider that the 9/11 Commission report urged 
the intelligence community to move from its need- 
to- know standard to a need-to-share approach [ssf . 
Experts believe that the 9/11 tragedy could have 
been avoided with better data integration. The tra- 
ditional way to design databases does not capture 
enough information to enable information integra- 
tion — in fact, it falls short of capturing precisely 
the kind of information that would be more valuable 
for integration. Hence, traditional design not only 
fails to alleviate the problem, it is helping to per- 
petuate it. Most data integration approaches start 
by trying to determine the similarity between at- 
tributcslll Since most design approaches treat at- 
tributes as barely more than labels, one has usually 
only a string to work with: information about at- 

^ Several approaches rely on statistical pro perties of 
data, and choose not to try to interpret it [3g| . It is un- 
clear whether this is done in search of generality or due 
to need; but we believe that, while this approach pro- 
vides important information, statistical properties can- 
not establish semantic similarity by themselves — but see 
Halevy et al. [11] for a different viewpoint. 



tributes (metadata) is usually absent [53[- As long 
as design focuses on how to structure attributes in 
tables and not in what attributes mean, the prob- 
lem will be with us. In the end, we ask practitioners 
to follow a model that is demanding and yields, in 
return, some very limited results. 

The lack of appropriate metadata is even more 
acute in new applications, ranging from financial to 
legal systems. A prominent example is e-science: 
scientists need not only to store larger and larger 
amounts of data. They also need to be able to assess 
provenance [H^], access rights, workflows, etc. in 
order to comply with ever increasing regulations, to 
be able to share the data, and to achieve the goal 
of reproducible research [58|. On this, traditional 
design offers no guidance. 

To make matters worse, the focus on structure 
creates rigidity. Kiely and Fitzgerald [37| found 
that traditional information systems development 
methods were sometimes perceived to be of lim- 
ited use within modern projects because they are 
too cumbersome and inflexible. This is the sense in 
which databases may be considered over- engineered. 
Consider the NoSQL movement ji^. A large force 
behind it are programmers for which database de- 
sign makes no sense. Tired of the rigid structure 
of relational databases, other systems (Raven DbH 
Amazon 's SimplcDBH Apache's CouchDB|3 Mon- 
goDEEl) are emerging. What good is it to design if it 
fails to make the developers more productive? Un- 
fortunately, the mismatch between objects and pro- 
gram structures on one hand, and database struc- 
tures on the other, is still largely unresolved. Mo- 
tivated by this problem, Microsoft has proposed the 
Language-Integrated Query (LINQ) framework [44 1. 
Other initiatives to bridge the gap have been devel- 
oped over the years — witness to the fact that the 
problem is still with us. 

3. WHY DOES IT FAIL? 

The traditional design method was developed in 
the early seventies, when mainframes dominated in- 
formation technology. It is in this era that the rela- 
tional [itI and entity-relationship models [l^ were 
invented. Accordingly, there are several assump- 
tions behind the traditional design which reflect its 
age: 

• Users are faceless objects for whom (or on whose 
behalf) the systems are designed fslil^ In the seven- 

^http : / /ravendb . net 
^http : / /aws . amazon . com/ simpledb/ 
http : / /couchdb . apache . org/ 
^http : // www . mongodb . org/ 

®In the quote sometimes attributed to Frederick the 



ties, the management of data was left in the hands 
of few experts who served the needs of technolog- 
ically unsophisticated employees. Nowadays, the 
boundaries between users, whether they are employ- 
ees or clients, and developers are blurred [i^. This 
is best illustrated with how hashtags emerged on 
the microblogging platform Twitter. Hashtags are 
a metadata convention among Twitter users [soj . 
in the spirit of folksonomies (55|. Yet Twitter it- 
self had no support for metadata. We can trace 
back the current convention to a single user who 
informally proposed it in a 140-character post in 
August 2007. Later, Twitter engineers recognized 
the convention and added software support for it. 
For example. Twitter detects "trending topics" us- 
ing popular hashtags. A few other conventions, like 
the "retweet" were first initiated by the users. Sun- 
dara Nagarajan has recently expressed the same 
idea [4§|: "Empowered end users cause application 
systems to evolve at tremendous speeds and con- 
tinuously create new requirements for interopera- 
tion. For instance, a social networking site user can 
add content and pointers from a website, by simply 
dragging and dropping. The evolution of mashups 
that combine data and functionality from multi- 
ple sources is another example of this new design 
paradigm. This is leading to the evolution of the 
user experience, along with computation and data 
management." When systems are designed without 
the users, a lack of user engagement may result: 
93% of all accounts in Business Intelligence systems 
are never used (iHt . 

• The information system is strongly consistent. It 
has been estimated that Google alone has more than 
1 million servers. Using cloud computing, anyone 
can use a distributed network of servers at a modest 
cost. With multiply located servers and deeply inte- 
grated web services, the CAP theorem [25| implies 
in practice that we have to choose between strong 
consistency and strong availability: we cannot have 
both. As a possible illustration of this constraint, 
the recent failure of an Oracle database at JPMor- 
gan Chase, which froze $132 million in assets and 
lost thousands of loan applications, was blamed on 
an database |47| which required strong consistency 
for all data. 

• Semantics is absolute. The original design as- 
sumed a centralized architecture. This architectural 
assumption had a reflection on the conceptual level, 
where one main viewpoint was assumed. While se- 
mantic relativism is pointed out |2ll . l5lLl56l |. a choice 
must be made for a single model: there is no mech- 



Great, it's "everything for the people, but witliout the 
people". 



anisms to derive other. Yet when different systems 
must interact routinely, we cannot expect that they 
all share the same viewpoint. 

• The models are static. In the traditional setting, 
there is little need for evolution. Yet databases, 
even in large conventional corporations, are fast evolv- 
ing: 30% of all information systems are modified 
significantly every year [llj in Fortune 100 com- 
panies. Chen, the father of the entity-relationship 
model, recently recognized the difficulty by pointing 
out the inability of existing modeling techniques to 
cope with fast- varying world states 16[. 



are certainly papers which, without having design 
as their main goal, bring considerable contributions 
to the table. For instance, the research on datas- 
paces by Halevy et al. |27| has brought forth the 



4. WHAT NOW? 

Despite these difficulties, research on database de- 
sign has failed to make major progress in the last 
ten years or so. This is not to say that no research 
is done. For instance, the series of conferences on 
entity- relationship modeling (50| . while not totally 
focused on design issues, devote most of the pro- 
gram to them. Theoretical work is still ongoing 
in logical modeling [s^. Physical design research 
is still strong, sometimes driven by database ven- 
dors And there is still a community of dedi- 
cated researchers including notable researchers like 

among others. 



Thalheim |6l| and Olive |4c 

But overall the topic is not widely pursued. For 
instance, if one checks the last 10 years of the "ma- 
jor" database conferences (SIGMOD, VLDB, ICDE), 
the number of papers in database design is low: 
leaving aside physical design (that is, the fine tun- 
ing of storage structures for better performance) , we 
found less than ten talks mainly about database de- 
sign 0, S m Hi SI £2,] and most of them 
examine design issues within the confines of a re- 
stricted context (sensor databases 41], J42 
schema evolution user interaction. |62l 
tific databases [sl] , data warehousing j34 



XML 



^We do not claim any statistical validity for this ob- 
servation. For one, the sample used is limited — more 
conferences, and certainly some journals, should be in- 
cluded. For another, there is a subjective aspect to this 
analysis. For the sake of transparency, we explain our 
method: first, the web page of each conference, reached 
through the web site of the organization behind the con- 
ference was used: this gives access to session titles, pa- 
per titles and sometimes abstracts. A search was made 
for keyword "design", another for keyword "normal" (to 
obtain 'normalization' and so on), and another for "se- 
mantic". We checked the title and abstract for each 
match (the last keyword generated quite a few matches). 
As stated, papers on physical design (database tuning 
and index design) were excluded. The list of papers ob- 
tained is given for interested readers to judge by them- 
selves in the references above. Others may reasonably 
disagree exactly as to what to include, i.e., how to de- 



possibility of databases where the schema is im- 
plicit or at least not separated from the data and 
can evolve with it. It opens up some possibilities, 
but no paper on this project is about design per 
se. Likewise, research on scmistructured data (e.g., 
XML) has exposed the database community to the 
thought that design must be more flexible [6[ . How- 
ever, little of this seems to have percolated to more 
traditional (relational, object relational) data mod- 
els, and the design methodology for them. Finally, 
the total number of contributions remain low, even 
including this work, for such a crucial topic. 

The fact is, traditional database design is not a 
mainstream research topic. We believe that this is 
due to two main facts: first, for most researchers, 
work on database conceptual models is seen as too 
difficult, because the subject is "soft", not clearly 
formalized, and does not yield itself well to the typ- 
ical paper that one expects to see published in most 
conferences and technical journals. Second, work on 
relational design is considered useless as the topic 
is commonly taken as basically a mature and closed 
one. Certainly, there is always some more work 
that can be done (for instance, extending the idea 
of key and functional dependence to other models, 
like XML, has received some recent attention fl^ . 
as well as extending the concept itself to 'soft' func- 
tional dependencies H^]), but the subject is often 
considered "a solved problem". 

On teaching, we note that while some textbooks 
are quite good about pointing out to students the 
limits and difficulties of the process, others simply 
gloss over the issues and give the impression that 
this a "case closed" situation — which may con- 
tribute to the lack of research in the area. 

5. WHAT NEXT? 

Traditional database design fails to provide the 
tools needed to design databases in today's envi- 
ronment, but researchers have not updated or ex- 
panded the methodologies enough to keep up with 
the times. Should we continue teaching methodolo- 
gies which disappoint practitioners? 

A first step towards renewed emphasis on database 
design research is to come up with a fresh and timely 



fine 'mainly about database design' (but note that we 
include an invited talk and two tutorials!). However, 
unless one uses a generous notion of 'database design', 
we believe other people's results will be in the same or- 
der of magnitude as ours. 



approach. Different researchers will likely have dif- 
ferent viewpoints as to what are the most crucial 
or interesting problems. We submit the following 
research plan to open up a discussion. 

Design for a distributed world. 

• Wc must update database design methodolo- 
gies for new environments that did not exist 
in the 1970s. Though there were many failed 
attempts to replace the ACID-compliant re- 
lational database systems with better alterna- 
tives, the landscape has finally begun changing 
with the adoption of cloud computing. For ex- 
ample, the data consistency requirements (and 
other issues affecting distribution) should be 
made explicit during the design phase, so that 
they can be exploited when deciding an archi- 
tecture. In fact, many NoSQL designs assume 
that most operations can be kept local in or- 
der to ensure scalability, which means that one 
needs to know which data is likely to be in- 
volved in a transaction (logically related) in 
order to distribute the data ([591 makes the 
same point, implicitly). Along the same lines, 
deciding what can be made eventually consis- 
tent (versus what needs to be kept consistent 
at all times) , and what to do in the face of in- 
consistency, should be based on the semantics 
of data. Hence, such issues should be part of 
the design phase. 



• It is fashionable to talk about Big Data: one 
the main driver being this trend is our ability 
to quickly integrate diverse data sets to cre- 
ate new services. Correspondingly, easier data 
integration should become one of the primary 
goals of good database design. Another issue 
that Big Data brings is the distributed nature 
of the model. Do we need a 'distributed de- 
sign' approach? For instance, should design 
produce more or less independent modules or 
'chunks' of connected data, which can in turn 
be connected to each other in one or more 
ways? How would such a distributed design 
relate to Berners-Lee's linked data 0? Or per- 
haps wc should propose methodologies which, 
instead of starting from a clean slate, begin 
with the existing schemas (both within the or- 
ganization, and public ones) and build on top 
of them. Should wc shift the focus towards 
extensions of what there is? 



Rethink functional dependencies. 

• If Holland is right and normalization is for 
sissies ([1^), then one should question the fo- 
cus on functional dependencies in database de- 
sign. If this idea seems far-fetched, recall that 
data warehousing practitioners proposed a dif- 
ferent design methodology (the star schema) 
that docs not use the idea of functional depen- 
dency at all (rationalization of star schema in 
normal form came after the fact). The ques- 
tion then is whether there are other concept 
or concepts that can replace 'functional depen- 
dence' and be a good basis for design. 

• We know that enforcing functional dependen- 
cies in the schema is insufficient to ensure that 
the data is semantically consistent. There are 
many rules, some expressible as constraints, 
assertions, or triggers, that could be enforced 
to ensure meaningful data. But current design 
mostly ignores this information. Shouldn't we 
attempt to capture this information, which is 
most likely to have an impact on the quality 
of our data, during the design? (In which case 
we need to define a way to measure the impact 
of different types of rules in data quality and 
consistency.) How can these various rules be 
used together in design? Note that to answer 
this question one has to answer other, more 
basic, questions: how do these different rules 
interact? 

• Much of the database-design courses focus on 
functional dependence and normal form. It is 
often implied that the physical design ought 
to be a straight-forward application of the log- 
ical design. This is because, once, the equa- 
tion one relation = one table = one file held 
for virtually all relational systems. Yet it not 
longer applies. For example, many distributed 
or column-oriented database systems replicate 
data for speed or reliability. Is it time to com- 
pletely separate logical design from physical 
design, i.e. consider the relation as a purely 
conceptual entity? 

Design for imperfect knowledge. 

• We must cope with incomplete information (about 
the domain, the users, etc.) since in real sys- 
tems, the scope or boundary of a database, or 
its future usage, is often uncertain [i^. Thus, 
design should proceed with as few assumptions 

as possible. Until now, a certain closed world 



assumption mentality trickles all the way from 
the conceptual model to the database. Clearly, 
we live in an open world. Should be consider 
schemas as descriptive instead of prescriptive, 
which is what they are now? If so, what to 
do with data that does not follow the schema? 
Should any such data be allowed? Given the 
difficulty of determining in advance the type of 
data that the system may have to deal with, 
should the design include, for instance, a de- 
scription of data that should not be allowed, 
and leave the database open to all other data? 
To some extend, XML schema languages (e.g., 
XML Schema and Relax NG) seem to adopt 
such a permissive attitude, with the added re- 
quirement that the data be structured in a hi- 
erarchical manner. Unfortunately, our expe- 
rience is that the process is burdensome and 
is not widely used [l^l- Are there lightweight 
alternatives? 

• In turn, adopting an open world point of view 
will make it easier to support collaborative, 
evolutionary design as an integrated part of 
the workflow [111- The issue here is, how do we 
design databases with open-world model while 
insuring the necessary consistency? If we are 
going to give permission to users to modify a 
schema, how much freedom should users have? 
For instance, one could study whether design 
can be crowdsourced (and if so, how and un- 
der what constraints). In general, one need to 
decide what kind of changes can be support, 
whether they come from the users or from a 
designer. A deeper study of database evolu- 
tion could be of help here: could a system be 
designed that adapts its storage to changing 
schemas and requirements? Physical design 
is currently focused on query workload, that 
is, it adapts itself to the (changing) require- 
ments posed by the database queries. Gould 
some of these ideas be used to make the sys- 
tem reactive to changes in the schema? We 
find interesting that functional dependencies 
can be (roughly) classified as natural (one that 
reflects an invariant in the world: a person has 
only one height) or artificial (one that reflects 
a convention: each employee has to attend X 
meetings a month). The former are quite sta- 
ble, but the latter are subject to change (note 
that all so-called business rules are artificial!). 
Should a system be able to cope with changes 
in artificial dependencies (old ones cease to 
hold, new ones are added)? 



• New data stores in the NoSQL movement use 
non-relational data models: key/value, docu- 
ments, extensible records Probably the 
first research task for such data models is a 
clarification of their exact structure and prop- 
erties, since the terms are used somewhat loosely. 
But an immediate second is to decide whether 
they require a different approach to design (af- 
ter all, even NoSQL data stores require de- 
sign) or, to the contrary, whether design de- 
cisions can be kept independent of the data 
model. The question is not as trivial as it may 
seem: some of these new models allow an open 
schema, that is, one where the user can add 
attributes at will, while others still require, 
like relational databases, a closed schema, that 
is, one where all possible attributes are de- 
clared beforehand — yet others, like extensible 
records, combine both parts. 

• Though there has been much work done on 
probabilistic databases [l^ and soft functional 
dependencies [33 |. such subjects remain almost 
entirely distinct from database design. Yet se- 
mantics are not always absolute: some rela- 
tionships are merely almost always true. Thus, 
it is likely that there are many more soft de- 
pendencies or conditional dependencies than 
'standard' functional dependencies. (A condi- 
tional dependency is one that holds only under 
certain circumstances. For example, at some 
places, a married couple is always made of a 
man and a woman, but not at others.) Gurrent 
design practices tend to ignore all functional 
dependencies but the standard ones, which are 
but an extreme case [2^ . Should we make 
room in database-design methodologies for prob- 
abilistic metadata and several types of depen- 
dencies? If so, how would different types of 
dependencies be used? How would they be- 
have when put together? 

No doubt, different researchers will have differ- 
ent viewpoints on these issues. Some may object 
to some of the challenges included here; others may 
wish to direct attention to other problems not in- 
cluded here. We stress again that this plan is meant 
to start the discussion; let the debate begin. 
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